arXiv:1507.04396vl [cs.NA] 15 Jul 2015 


Parallel MMF: a Multiresolution Approach to Matrix 

Computation 


Risi Kondor*^ Nedelina Teneva* and Pramod K. Mudrakarta* 

*Department of Computer Science, ^Department of Statistics 
{risi, nteneva, pramodk:m}@cs . uchicago . edu 
The University of Chicago 

Abstract 

Multiresolution Matrix Factorization (MMF) was recently introduced as a method 
for finding multiscale structure and defining wavelets on graphs/matrices. In this 
paper we derive pMMF, a parallel algorithm for computing the MMF factoriza¬ 
tion. Empirically, the running time of pMMF scales linearly in the dimension for 
sparse matrices. We argue that this makes pMMF a valuable new computational 
primitive in its own right, and present experiments on using pMMF for two distinct 
purposes: compressing matrices and preconditioning large sparse linear systems. 


1 Introduction 

While the size of machine learning datasets continues to increase at the rate an order of magnitude or 
more every few years, the clock speed of commodity hardware has all but plateaued. Thus, increas¬ 
ingly, real world learning problems can only be tackled by algorithms that implicilty or explicitly 
exploit parallelism. 

Broadly speaking, there are two main approaches to parallelizing the optimization problems at the 
heart of machine learning algorithms. In the data parallel model, the dataset is divided into batches 
(shards) that are processed independently by separate processing units, but all processing units have 
(synchronous or asynchronous) access to a central resource that stores the current values of all the 
optimization variables, i.e., the model’s parameters IT]. In the parameter parallel approach, all 
processing units operate on the same data, but each one of them optimizes only a subset of the 
parameters Elia. Data parallel algorithms often exploit the assumption that data subsets are i.i.d. 
given the model parameters. However, such trivial parallelization is not suitable for model parallel 
algorithms, since it fails to capture the often non-trivial interactions between parameters and so 
taking independent parameter subsets could lead to flawed estimates. On the other hand, model 
parallelization is often achieved by course graining the parameter space (e.g., topological order or 
imposing some constraints on the graph induces by the parameter interaction), which might not be 
sufficient to discover the finer and more complex interactions between the parameters. 

In contrast, in this paper we advocate a multiresolution approach to parallelism, in which both the 
data and the parameters are parallelized, and communication between processing units is minimized. 
At the finest level of resolution, the data is divided into many shards, which are processed indepen¬ 
dently, determining the “high frequency” parameters of the model. These parameters are local to 
each shard, obviating the need for a central parameter server. Additionally, the data in each shard is 
compressed, so when it is redistributed across the cores for the second level, each shard becomes a 
compressed sketch of a larger subset of the original dataset, making it possible for the second level 
processing units to “learn” parameters that capture lower frequency, less localized features. Iterat¬ 
ing this process leads to a multi-level algorithm that simultaneously uncovers the structure of the 
original data, and fits a model. 

The specific algorithm that we focus on in this paper is a parallelized version of the Multiresolu¬ 
tion Matrix Factorization (MMF) process first described in 0. By itself, MMF is not a learning 
algorithm. However, parallel MMF is a gateway to efficiently performing a range of fundamental 
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computational tasks, such as matrix compression, inversion, and solving linear systems. These tasks 
are critical building blocks of most learning algorithms. Due to space restrictions, the bulk of our 
numerical experiments, as well as the implementation details of our algorithm, which are critical to 
making pMMF scale to large problems, are relegated to the Supplement. 

Notations. In the following, [n] will denote the set {1, 2,..., n}. Given a matrix A G and two 
(ordered) sets Si,S2 C [n], ,53 "'ill denote the | Si |x| S2 1 dimensional submatrix of A cut out by 

the rows indexed by and the columns indexed by S'2. Si will denote [n]\ 5 'i. BiUB'^.. .UBm = [n] 
denotes that the sets Bi,..., Bm form a partition of [n]. A.^i or [A].^i denotes the Fth column of A. 


2 Parallel Multiresolution Matrix Factorization 


The Multiresolution Matrix Factorization (MMF) of a symmetric matrix A G is a multi-level 
factorization of the form 

A«Q7...Ql_iQliFQiQi_i...Qi, (1) 

where Qi,..., is a sequence of carefully chosen orthogonal matrices (rotations) obeying a num¬ 
ber of constraints; 

1. Each is chosen from some subclass Q of highly sparse orthogonal matrices. In the simplest 
case, Q is the class of Givens rotations, i.e., orthogonal matrices that only differ from the identity 
matrix in four matrix elements 


[Qe]i,i = cos 0 , [Qe]i,i = - sin 0 , 

— sin^, — cos^, 

for some pair of indices (z, j) and rotation angle 0. Slightly more generally, Q can be the class 
of so-called k-point rotations, which rotate not just two, but k coordinates, (zi,..., z^). 

2. The effective size of the rotations decreases according to a set schedule n = 60 > Si > ... > 6 l, 
i.e., there is a nested sequence of sets [n] = Sq ^ Si ^ A Sl with 15^1 = Sg such that 

S7T ^ ~ dimensional identity. Sg is called the active set at level £. In the 
simplest case, exactly one row/column is removed from the active set after each rotation. 

3. H is SL-core-diagonal, which means that it is all zero, except for (a) the submatrix [H]s^,Sl 
called the core and (b) the rest of the diagonal. 

Moving the rotations in ([T]) over onto the left hand side, the structure implied by the above conditions 
can be represented graphically as 


,)W"(\)-(\) 

Ql Q 2 Qi ^ Qi Q 2 Ql ^ 

( 2 ) 

Here, for ease of visualization, A has been conjugated by a permutation matrix P, which ensures 
that Si = {1,..., 5^} for each 1. However, an actual MMF would not contain such an explicit 
permutation. In general, MMF is only an approximate factorization, because there is no guarantee 
that using a given number of rotations A can be brought into core-diagonal form with zero error. 
Most MMF factorization algorithms try to find Qi,... ,Ql H so as to minimize the squared 
Frobenius norm of the difference between the l.h.s and r.h.s of (|^, called the residual. 

The original motivation for MMF in H was to mimic the structure of fast orthogonal wavelet trans¬ 
forms. For example, when A is the Laplacian matrix of a graph Q and U = Ql ■ • • Q 2 Q 1 , the rows 
of U are interpreted as wavelets on the vertices of Q. Those rows whose indices lie in So\Si are set 
by the first rotation matrix Qi, and are not modified by the rest of the rotations. (If Qi is a Givens 
rotation, there is only one such row, but for the more complicated rotations there might be several.) 
These rows are very sparse, and provide the lowest level, most local, highest frequency wavelets. 
The rows indexed by 5'i\5'2 are determined by Q 2 Q 1 , and correspond to level 2 wavelets, and so 
on. Writing the MMF as ^ « U^HU suggests that the rows of U can also be interpreted as a 
hierarchically sparse PCA basis for A. Finally, since the size of the active set decreases after each 
rotation, defining Ai = Qi..., QiAQj ... Qj, the sequence of transformations 



A — Aq I —y Ai I —y A 2 I —y ... I—^ Al I —y H 


( 3 ) 
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is effectively a matrix compression scheme, which, according to i] , often significantly outperforms, 
for example, Nystrom methods. MMF is closely related to Diffusion Wavelets a and Treelets 0. 

Due to the above properties, MMF is an attractive tool for uncovering the structure of large datasets. 
However, it has one fundamental limitation, which is its computational cost. The greedy factoriza¬ 
tion algorithm described in a essentially follows the sequence of transformations in 0, at each 
level choosing Qi and a set of rows/columns to be eliminated from the active set, so as to minimize 
their contribution to the final approximation error. Assuming the simplest case of each Qi being a 
Givens rotation, this involves (a) finding the pair of vertices {i,j) involved in Q^, and (b) finding 
the rotation angle 9. In general, the latter is easy. However, finding the optimal choice of (i, j) (or, 
in the case of fc-point rotations, (ii,..., ife)) is a combinatorial problem that scales poorly with n. 
The first obstacle is that the optimization is based on inner products between columns, so it requries 
computing the Gram matrix at a complexity of O(n^). Note that this need only be 

done once: since rotations act on Ge the same way that they act on A, once we have Gi = A^A, each 
subsequent Gg can be efficiently derived via the recursion Gg+i = QiGiQj. The second obstacle 
is that searching for the optimal (ii,..., ik) has complexity 0 {n^). 

The objective of the present paper is to construct a parallel MMF algorithm, pMMF, which, on 
typical matrices, assuming access to a sufficient number of processors, runs in time close to linear 
in n. The ideas behind the new algorithm exploit the very structure in A that MMF factorizations 
pursue in the first place, namely locality at multiple levels of resolution. 

2.1 Clustering 

The first and crucial step towards pMMF is to cluster the rows/columns of A into m clusters and 
only consider rotations between /c-sets of rows/columns that belong to the same cluster. Letting 
be the indices of the rows/columns belonging to cluster u, clustering has three immediate benefits: 

1. Instead of having to compute the full Gram matrix G = A^A, it is sufficient to compute the local 

Gram matrices Assuming that the clustering is even, i.e., |i3„| = 0(c) for 

some typical cluster size c (and therefore, m = 0(n/c)), this reduces the overall complexity of 
computing the Gram matrices from 0{rA) to 0{mc^n) = O(cn^). 

2. The complexity of the index search problem involved in finding each Qf is reduced from 0(n^) 
to 0{c^). In typical MMFs 5^ — 0{n), and the total number of rotations, L, scales linearly with 
11 . Therefore, the total complexity of searching for rotations in the unclustered case is 
whereas with clustering it is O(c^n). 

3. The Gram matrices and the rotations of the different clusters are completely decoupled, therefore, 

on a machine with at least m cores, they can be computed in parallel, reducing the computation 
time of the above to 0 {cm?/m) = 0 {c^n) and 0 {c^n/m) = respectively. 

Using the clustering Hi U i ?2 W ... U = \n] results in an MMF in which each of the matrices 
(and hence, also their product) are (Hi,..., Hm)-block-diagonal, as defined below. 

Definition 1 Given a partition H 1 UH 2 U.. .UH^, of[n], we say that M G is (Hi,..., Bm)- 

block-diagonal if Mi j = 0 unless i and j fall in the same cluster B^for some u. 

Clustering, in the form described above, decouples MMF into m completely independent subprob¬ 
lems. Besides the inherent instability of such an algorithm due to the vagaries of clustering algo¬ 
rithms, this approach is antithetical to the philosophy of multiresolution, since it cannot discover 
global features in the data: by definition, every wavelet will be local to only one cluster. The natu¬ 
ral solution is to soften the clustering by repeatedly reclustering the data after a certain number of 
rotations. Writing out the full MMF again and grouping together rotations with the same clustering 
structure 

A ~ Qi ■ ■ ■ QJ^ Qh+i ■ ■ • QJ 2 . QIp H Qip^... .. . Q 12 ■. . ^Qi^+iQi^ .Qi 

Qp 

results in a factorization 

A K, Q 2 ... QpH Qp ... Q 2 Q 1 , (A) 

where each Qp, which we call a stage, is now a product of many Qg elementary rotations, all 
conforming to the same block diagonal structure {Bf ,..., B^^). Note that the number of clusters. 
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Algorithm 1 pMMF (top level of the algorithm) 

Input: a symmetric matrix A G 
Set Aq — A 
for (p = 1 to P) { 

cluster the active columns of Ap_i to get (B^, ..., B^) 

reblock Ap_i according to (Bf,..., B^) 

for (u = 1 to to) Qp „ ^ FmdRotationsInCluster(p, u) 

for (u= 1 tOTO) for(i;= 1 tOTO.) set |Ap]^_^ ^ Qp_„ |Ap_i]^_^ Qp^ 
merge (Qp i,..., Qp_„) into Qp 

} 

H G- the core of Al plus its diagonal 
Output: (P,Qi,...,Qp) 


TOp, might not be the same across stages. In particular, since the active part of Ai progressively gets 
smaller and smaller, nip will usually decrease with p. The top level pseudocode of pMMF, driven 
by this repeated clustering process, is given in Algorithmic 


2.2 Randomized greedy search for rotations 


The second computational bottleneck in MMF is hnding the k rows/columns involved in each ro¬ 
tation. To address this, we use a randomized strategy, whereby hrst a single row/column ii is 
chosen from the active set (within a given cluster) unformly at random, and then k — 1 further 
rows/columns 12 , ■ ■ ■ ,ik are selected from the same cluster according to some separable objective 
function (j){i 2 , ■ ■ ■ ,ik) related to minimizing the contribution to the final error. For simplicity, in 
pMMF we use 


k 

(j){i2,...,ik) = 

r=2 


II [^^-l]:.vll 


i.e., is rotated with the k—1 other columns that it has the highest normalized inner product 

with in absolute value. Similarly to il, the actual rotation angle (or, in the case of k’th order 
rotations, the kx k non-trivial submatrix of Qi) is determined by diagonalizing 
at a cost of only 0{k^). This aggressive randomized-greedy strategy reduces the complexity of 
hnding each rotation to 0(c), and in our experience does almost as well as exhaustive search. The 
criterion for elimination is minimal off-diagonal norm, || A;_i||off.dmg = (II because 

2 II is the contribution of eliminating row/column i to the hnal error. 


2.3 Blocked matrices 

In a given cluster m of a given stage p, the local Gram matrix G“, and the rotations can be determined 
from the columns belonging to just that cluster. However, subsequently, these rotations need to be 


Algorithm 2 FindRotationsInCluster(p, u) — here k = 2 and 77 is the compression ratio 

Input: a matrix A = [Ap]. S made up of the c columns of Ap_i forming cluster u in Ap 
compute the Gram matrix G = A^A 
set I = [c] (the active set) 
for (s = 1 to [pcj){ 
select iG I uniformly at random 
find j = argmaxp\{ij | | /||A:j|| 

find the Givens rotation Qs of columns {i,j) described in the text 
set A G- qsAqJ 
set G G- QsGq^ 

lloff-diag< II Ad llotf-diag eliminate i ; otherwise eliminate j 

} 

Output. Qp^^ ■ ■ ■ q 2 qi 
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Figure 1: Schematic of a blocked matrix M with 5x5 blocks. For the sake of visual clarity, we 
assumed that the blocks are contiguous, but, in general, this is not the case. The reblocking process 
involves first reorganizing the rows acoording to the new structure, then reorganizing the columns. 
To perform this efficiently, the first operation is done in parallel for each column of blocks of the 
original matrix, and the second operation is done in parallel for each row of blocks. 


applied to the entire matrix, from both the right and the left, which cuts across clusters. To be able to 
perform this part of the algorithm in parallel as well, we partition A not just column-wise, but also 
row-wise. The resulting data structure is called a symmetrically blocked matrix (c.f., Q). 

Definition 2 Given a matrix A S and a partition i?i U ^2 kJ... U of n, the {u, v) block of 

A is the submatrix |^]„ „ := Ab^,b„- The symmetric blocked matrix form of A consists of the mf 
separate matrices 

In pMMF, the matrix Ai is always maintained in blocked matrix form, where the block structure 
is dictated by the clustering of the current stage. For large matrices, the individual blocks can be 
stored on separate cores or separate machines, and all operations, including computing the Gram 
matrices, are performed in a block-parallel fashion. This further reduces the time complexity of the 
Gram matrix computation from 0{n<f‘) to O(c^). Assuming rrip-fold parallelism, and a total of pc 
rotations in stage p, the overall time needed to apply all of these rotations to the entire matrix scales 
with 0 {rjkc^). 

The blocked matrix data structure is ideally suited to carrying out each stage of MMF on a parallel 
system, because (except for summary statistics) no data needs to be communicated between the 
different blocks. However, changing the block structure of the matrix from one clustering to another 
can incur a large communication overhead. To retain m-fold parallelism, the reblocking is carried 
out in two phases: first, each column of blocks is reblocked row-wise, then each row of blocks in 
the resulting new blocked matrix is reblocked column-wise (Figure[T]l. 


2.4 Sparsity and matrix-free MMF arithmetic 

Ultimately, pMMF is intended for factoring matrices that are sparse, but whose dimensionality is 
in the hundreds of thousands or millions. As the factorization progresses, the fill-in (fraction of 
non-zeros) in Ap will increase, but at the same time, the active part of An will progressively shrink. 
This means that in practice, given a sufficiently highly parallel system, the overall complexity can 
still scale roughly linearly with the number of non-zeros in A. 

The complete factorization appearing on the r.h.s. of ([^ we denote A. Storing A by storing H and 
the {Qn} matrices separately, the space complexity scales roughly linearly in the dimension. On the 
other hand, computing A explicitly as a dense matrix is usually unfeasible. Therefore, when 

applying the computed factorization, for example, as a preconditioner (Section |^, which requires 
repeatedly multilying a vector v by A, we use the so-called matrix-free approach: v is stored in 
the same blocked form as A, the rotations are applied individually, and as the different stages are 
applied to v, the vector goes through an analogous reblocking process to that described for A. The 
complexity of matrix-free MMF/vector multiplication is 0{kpn). Inverting A, which is also critical 
for downstream applications, involves inverting the entries on the diagonal of H and inverting the 
core matrix, thus the overall complexity of MMF inversion Oin + 5\). 

The theoretical complexity of the main components of pMMF are summarized in Table[T] Of course, 
requiring m^-fold parallelism as m —>^ oo is an abstraction. Note, however, that even the total 
operation count scales with which is just the number of non-zeros in the original matrix. The 
plots in Figure and similar plots in the Supplement confirm that on many real world datasets. 
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Figure 2; Wall clock time of pMMF in seconds as a function of the number of non-zeros, 771 ^, in 
A on a 32 core 2.6 GHz machine. A is derived from the Laplacian of standard benchmark sparse 
graphs (see Supplement), taking submatrices of different sizes. The hgures conhrm that in practice 
pMMF often scales linearly in the number of nonzeros, hence, for bounded degree graphs, also in n. 

particularly, matrices coming from sparse network graphs, the wall clock time of pMMF tends to 
scale linearly with the dimension. Also note that in these experiments n is on the order of 10"^ ^ 10®, 
yet the factorization time is on the order of just one minute. 

3 pMMF Compression 

Most, if not all, machine learning algorithms reduce to linear algebra operations or optimization over 
large instance/feature matrices or instance/instance similarity matrices. The classical example is, of 
course, kernel methods, which reduce to convex optimization involving the so-called Gram matrix, 
a symmetric positive semi-definite (p.s.d.) matrix of size nx n, where n is the number of training 
examples. Despite their many attractive properties and their large literature, the applicability of 
kernel methods to today’s large datasets is limited by the fact that “out of the box” their computation 
time scales with about n®. This issue (not just for kernel methods, but more broadly) has catalyzed 
an entire area focused on compressing or “sketching” matrices with minimal loss. 

In the symmetric (and p.s.d) case, most sketching algorithms approximate A G in the form 

A = where C is a judiciously chosen matrix with m^n, and is computed 

by taking the pseudo-inverse of a certain matrix that is of size only m x m. The algorithms mainly 
differ in how they define C: 

(i) Projection based methods set each column of C to be a random linear combination of the 
columns of the original matrix A, and use Johnson-Lindenstrauss type arguments to show that 
the resulting low dimensional random sketch preserves most of the information at least about 
the part of A spanned by its high eigenvalue eigenvectors |0. These methods come with strong 
guarantees, but suffer from the cost of having to compute m dense linear combinations of the 
columns of A. Even if A was sparse, this process destroys the sparsity. 

(ii) Structured projections are a twist on the above idea, replacing the random projection with a 
fixed, dense, but efficiently computable, basis transformation (and subsampling in that basis), 
such as the fast Hadamard transform or the fast Fourier transform ||9l [TOl . 

(iii) In contrast to (i) and (ii), Nystrom algorithms construct C by choosing a certain number of 
actual columns of A, usually by random sampling. Here, the focus has shifted from uniform 
sampling imEi, via ( 2 -norm sampling m, to sampling based on so-called leverage scores 
OllElIIll, which, for low rank matrices, can be shown to be optimal. Further recent develop¬ 
ments include the ensemble Nystrom method iflTl and the clustered Nystrom algorithm IfTSl . 
Finally, a number of adaptive algorithms have also been proposed QSlEolIII]. 

pMMF can also be regarded as a sketching method, in the sense that Q compresses A into an 
m:=5L^n dimensional core via a very fast series of orthogonal transforms Ql ■ ■ ■ Q 2 Q 1 , which 
play a role analogous to CT In contrast to the other methods, however, MMF also retains the entries 
on the diagonal of H. From the point of view of downstream computation, this incurs very little 
extra cost. For example, if the purpose of sketching A is to compute its inverse, maybe for use in a 
Gaussian process or ridge regression, then A~^ can be easily computed by inverting the core of H 
(as in the Nystrom methods), and just taking the inverse of all other matrix elements on the diagonal. 

The main distinction between pMMF and other matrix sketching methods is that while the latter, 
implicitly or explicitly, make the assumption that A is low rank, or close to low rank, MMF makes 
a different structural assumption about A, namely that it has hidden hierarchical or multiresolution 
structure. Figure shows the results of experiments comparing the perfomance of MMF to other 
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dimension of compressed matrix 




Figure 3: The Frobenius norm error || A — A Ijprob of compressing matrices with pMMF vs. other 
sketching methods, as a function of the dimension of the compressed core. In each figure, the error 
is normalized by \\ A — Ijprob^ where A]~ is the best rank k approximation to A. The four datasets 
are “HEPph”, “AstroPh”, “CondMat” and “Gisette”, with k = 100 in the first three and fc = 12 in 
“Gisette”. 


matrix sketching algorithms on some standard datasets. As in panes 1-3, on most datasets that we 
tried, pMMF significantly outperforms the other sketching methods in both Frobenius norm error 
and spectral norm error (plots of the latter can be found in the Supplement). The advantage of 
pMMF seems to be particularly great on network graphs, perhaps not surprisingly, since it has long 
been conjectured that networks have multiresolution structure ll22l l23l l24l . However, we find that 
pMMF often outperforms other methods on kernel matrices in general. On the other hand, on a 
small fraction of datasets, typically those which explicitly have low rank or are very close to being 
low rank (e.g., the fourth pane of Figure]^, pMMF performs much worse than expected. In such 
cases, a combination of the low rank and multiresolution approaches might be most advantageous, 
which is the subject of ongoing work. 

It is important to emphasize that pMMF is very scalable. Many other Nystrom methods are im¬ 
plemented in MATLAB, which limits the size of datasets on which they can be feasibly ran on. 
Moreover, leverage score methods require estimating the singular vectors of A, which, unless A 
is very low rank, can be a computational bottleneck. Several of the Nystrom experiments took 30 
minutes or more to run on 8 cores, whereas our custom C-H- pMMF implementation compressed 
the matrix in at most one or two minutes (see more timing results in the Supplement). Hence, 
pMMF addresses a different regime than many other Nystrom papers: whereas the latter often focus 
on compressing ^ 10^ dimensional matrices to just 10-100 dimensions, we are more interested in 
compressing ~ 10^-10® dimensional matrices to ^ 10^ dimensions. 


4 pMMF Preconditioning 


Given a large matrix A G and a vector b G K", solving the linear system Ax = b is one 

of the most fundamental problems in computational mathematics. In the learning context, solving 
linear systems is the computational bottleneck in a range of algorithms. When A is sparse. Ax = b 
is typically solved with iterative methods, such as conjugate gradients. However, it is well known 
that the number of iterations needed for such methods to converge scales badly with k, where k is 
the ratio of the largest and smallest eigenvalues of A, called the condition number. 


The idea of preconditioning is to solve instead of Ax = b the related system {M~^A) x = M~^b, 
where M~^ is an easy to compute rough approximation to A~^. A good preconditioner will ensure 
that M~^A is fast to multiply with the current vector iterate, while the condition number of M~^A 
is much better than that of the original matrix A. At the same time, it is important that M~^ be 
easily computable for massive matrices. For symmetric matrices, a variation on the above is to solve 
2 / = and then set x = which retains symmetry. pMMF is a 

natural candidate preconditioner for symmetric matrices since (a) the pMMF factorization can be 
computed very fast (b) as evidenced by the previous section, A is a good approximation to A, (c) 
(with ^ G {1,1/2}) can be computed from A in just 0(n + i5£) time. However, unlike some 
other preconditioners, is generally not sparse. Therefore, in MMF preconditioning one never 


expands A ^ into a full matrix, but rather A ^ is applied to the vectors involved in the iterative 


method of choice as a sequence of rotations, as described in Section 2.4 


A large number of different preconditioners have been proposed in the literature, and even for a given 
type of problem there is often no single best choice, rather the choice reduces to experimentation. 
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Figure 4: Residual as a function of conjugate gradient iterations when solving Ax = b, where b is 
a dense random vector. The indicated times are the wall clock time to convergence on an 8 core 2.6 
GHz machine. The per-iteration time of pMMF preconditioning is usually 2-10 times faster than of 
other methods. pMMF is indicated in red. 


In our experiments our goal was to show that pMMF preconditioning can improve the convergence 
of linear solvers in learning problems, and that it is competitive with other preconditioners. Fig¬ 
ure compares the performance of pMMF as a preconditioner to other standard preconditioners, 
such as incomplete Cholesky and SSOR. Several more preconditioning results are presented in the 
Supplement. In summary, pMMF appears competitive with other preconditioners on network and 
kernel matrices, and sometimes outperforms other methods. In each of our experiments, the time 
required to compute the pMMF preconditioner was less than a minute, which is amortized over the 
number of linear solves. We are still experimenting with how much pMMF as a preconditioner can 
be improved by fine tuning its parameters, and how well it will perform coupled with other solvers. 


5 Conclusions 

The most common structural assumption about large matrices arising in learning problems is that 
they are low rank. This paper explores the alternative approach of assuming that they have multires¬ 
olution structure. Our results suggest that not only is the multiresolution model often more faithful 
to the actual structure of data (e.g., as evidenced by much lower approximation error in compres¬ 
sion experiments), but it also lends itself to devising efficient parallel algorithms, which is critical 
to dealing with large scale problems. Our approach bears some similarities to multigrid methods 
E 6 \ and structured matrix decompositions ll27l l28l |29]| . which are extremely popular in applied 
mathematics, primarily in the context of solving systems of partial differential equations. A crucial 
difference, however, is that whereas in these algorithms the multiresolution structure is suggested 
by the geometry of the domain, in learning problems the structure itself has to be learnt “on the fly”. 
Empirically, the pMMF algorithm described in this paper scales linearly in the size of the data. Fur¬ 
ther work will explore folding entire learning and optimization algorithms into the multiresolution 
framework, while retaining the same scaling behavior. 


serial MMF pMMF operations pMMF time 



dense 

sparse 

dense 

sparse 

dense 

sparse 

^proc 

Computing Grams 

O(n^) 

0 ( 7 n^) 

0 {pcn^) 

0 {'fpcn'^) 

0 {pc^) 

olipc^) 

-- 

Finding Rotations 

0(n3) 

0(n3) 

0 {cn) 

0 {cn) 

0 {^) 

0 {^) 

m 

Updating Grams 

0(n3) 

0 ( 7 ^n^) 

0 {c^n) 


0(c3) 

0(72c3) 

m 

Applying rotations 

Oikv?) 

0 {-fkn?) 

0 {kri^) 

0 {'-fkn^) 

0 {kc^) 

0 {'^kc^) 


Clustering 



0 {pmn^) 

0 {'jpmn^) 

0 {pcn) 

0 {'jpcn) 


Reblocking 



0 {pn?) 

0 {'^P'n?) 

0 {pcn) 

0 {'jpcn) 

m 

Factorization total 

O(n^) 

0 {n^) 

0 {pcn^) 

0 {jpcn^) 

0 {pc*) 

0 {jpc^) 



Table 1: The rough complexity of different subtasks in pMMF vs. in the original serial MMF 
algorithm of n. Here n is the dimensionality of the original matrix. A, k is the order of the 
rotations, and 7 is the fraction of non-zero entries in A, when A is sparse. We neglect that during the 
course of the computation 7 tends to increase, because concomitantly Af, shrinks, and computation 
time is usually dominated by the first few stages. We also assume that entries of sparse matrices can 
be accessed in constant time. In pMMF, p is the number of stages, m is the number of clusters in 
each stage, and c is the typical cluster size (thus, c = 6 {n/m)). The “pMMF time” columns give 
the time complexity of the algorithm assuming an architecture that affords Wproc-fold parallelism. 
5 = is the size of the dense core in H. It is assumed that k<p<c<n, but n = o(c^). 
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